Data training in multi-sensor setups

ABSTRACT

A system and method for constructing training dictionaries with multichannel information. An exemplary method takes into account the effect of the acoustic path while training multichannel acoustic data. A method that uses different time-frequency resolutions in machine learning training is also presented.

RELATED APPLICATION

This application claims the benefit of and priority under 35 U.S.C.§119(e) to U.S. Patent Application No. 62/170,793 filed Jun. 4, 2015,entitled “Data Training in Multi-Sensor Setups,” which is incorporatedherein by reference in its entirety.

TECHNICAL FIELD

Various embodiments of the present application relate to trainingmethods for supervised or semi-supervised machine learning applications.Aspects also relate to improving all fields of signal processingincluding but not limited to speech, audio and image processing, radarprocessing, biomedical signal processing, medical imaging,communications, multimedia processing, forensics, machine learning, datamining, etc.

BACKGROUND

Machine learning is important in the signal processing field. There aremany tasks that can be performed by machine learning methods such asclassification, regression, clustering, dimensionality reduction, etc.In the case of supervised or semi-supervised learning methods, acomplete or incomplete training dictionary is required. Supervised andsemi-supervised approaches take advantage of training information (oftenin the form of training dictionaries) to improve performance, accelerateconvergence or ensure convergence in the iterative algorithms that areoften used in machine learning applications. Often these applicationsinvolve seeking solutions using iterative algorithms where a signalglobal optimum solutions do not exist and instead a number of saddlepoint solutions (or local optima) can be found during the iterativeprocess. Supervised and semi-supervised approaches introduce informationbased on training data into the iterative algorithm, often in the formof initial states or initial conditions, in order to cause the algorithmto converge to a desirable choice of saddle point or to a desirablelocal optimum. Given that for many real-life applications training dataare indeed available, there's an opportunity for new methods and systemsto produce intelligent training dictionaries that can then be used to atleast improve the performance, efficiency and operation of machinelearning algorithms.

In accordance with one exemplary embodiment, a method is presented thatenable the generation of such training dictionaries, in particular forsource separation techniques that use non-negative matrix factorization(NMF) approaches. The performance of NMF methods depends on theapplication field and also on the specific details of the problem underexamination. In principle, NMF is a signal decomposition approach and itattempts to approximate a non-negative matrix V as a product of twonon-negative matrices W (the matrix of bases) and H (the matrix ofactivation functions). To achieve the approximation, a distance or errorfunction between V and WH is constructed and minimized. In a mostgeneral case, the matrices W and H are randomly initialized. However, inorder to improve performance and ensure convergence to a meaningful,desirable or useful factorization (the desirable “saddle point” or localoptimum), the use of a training step and training data can be employed.Such methods that include a training step are referred to as supervisedor semi-supervised NMF.

During the last decades both (i) the available computation that can beallocated to signal processing applications, and (ii) the number ofavailable sensors that gather data, continuously increase. Thus, moreinformation is readily available as is the processing power to takeadvantage of it. However, many traditional signal processing techniquesare designed and contemplated only for single sensor signals. Theavailability and use of multi-sensor information can significantlyimprove the performance of signal processing tasks. Therefore, there's aneed for new signal-processing methods and systems that exploremulti-sensor information.

Live music events and studio recordings are examples where signalprocessing is usually performed on single microphone signals, despitethe fact that inputs from many microphones are simultaneously available.In a typical live music event, dozens or even hundreds of signal inputsmight be simultaneously available. Despite the fact that all these soundinputs are gathered and processed at a single location (for example atthe main mixer), there are no inherent multichannel signal processingmethods available to sound engineers. In addition, there are other caseswhere multi-microphone inputs are available simultaneously, includingbut not limited to recording studios, hearing assistive and hearing aiddevices, mobile phones, active ear protection systems, public addresssystems, teleconference and conference systems, hands-free devices,automatic speech recognition systems, multimedia software and systems,systems for professional audio, DECT phones, desktop or laptopcomputers, tablets, etc.

Therefore, there is a need for new and improved signal processingmethods and systems that take into account the multichannel informationin multi-microphone setups and in general, in multi-sensor environments,where a sensor may be any passive or active device (or combinationthereof) that is used for capturing, reading, measuring and/or detectingone or more signals (including audio signals, speech signals, images,videos, communications signals such as wireless, radio waves, opticalsignals and/or the like.)

A typical trade-off for most signal processing methods is the choice ofthe time-frequency resolution. According to Heisenberg's uncertaintyprinciple, a signal cannot be sharply localized simultaneously in timeand in frequency. In a more general form, the uncertainty principleasserts a fundamental limit to the precision with which certain pairs ofphysical properties, known as complementary variables, can be knownsimultaneously. This limitation can be important during the trainingphase of machine learning algorithms where both complementary variables(for example, time and frequency signal data) are important and must beaccurately captured. Hence, there is a need for methods and systems thatdeal with the uncertainty principle during the training phase of machinelearning methods, by allowing multiple time-frequency representations tobe considered simultaneously.

In the art, while the use training data to assist in the convergence ofiterative algorithms has been discussed, the capture and use of usefulmulti-sensor information is not taken into account in training machinelearning algorithms. Neither is the simultaneous use of training signalsthat represent multiple time-frequency resolutions. For example, U.S.Pat. No. 8,015,003 B2, to Wilson et al. (which is incorporated herein byreference in its entirety) presents a method for “Denoising acousticsignals using constrained non-negative matrix factorization”. In thispatent, the training signals are “representative of the type of signalsto be denoised” and both noise and speech are represented fromcorresponding training dictionaries.

However, multi-sensor information is not taken into account nor are anyprecautions for the uncertainty limitations, and no description oftraining signals for multi-sensor environments is provided. In “Singlechannel speech music separation using nonnegative matrix factorizationand spectral masks,” 7th International Conference on Digital SignalProcessing (doi: 10.1109/ICDSP.2011.6004924) (which is incorporatedherein by reference in its entirety), Grais and Erdogan use NMF forseparating speech from music. To facilitate training, they use copies ofspeech utterances from the test speaker and recordings of piano piecesfrom the same artist. Again the authors make no explicit use ofmulti-sensor information and provide no solution for dealing with thechallenges posed by the uncertainty principle (for exampletime-frequency limitations).

In “Single-channel speech separation using sparse non-negative matrixfactorization” (Interspeech 2006) (which is incorporated herein byreference in its entirety), Schmidt and Olsson use two ways to learnspeech dictionaries: (a) by using a large training data set of a singlespeaker, or (b) by segmenting the training data according to phonemelabels.

Again, no multi-sensor information is used and no effort for reducingthe limitations of time frequency uncertainty is made. As can be seenfrom these and other related work, the primary purpose for usingtraining signals in NMF is to provide at least starting points for thematrices that are used in the decomposition (the W or H matricesdescribed above) so as to accelerate or improve convergence to aniterative solution. Typically, training is accomplished either by usinga dataset of signals having common characteristics with the “desired”signal or by using a version of the “desired” signal itself. Expandingtraining dictionaries to include the use of multi sensor information aswell as to cope with the time-frequency analysis limitations in machinelearning training and more specifically in NMF training is a primarygoal of the methods and systems disclosed in this invention.

As discussed above, training signals can take the form of prerecordedaudio or speech signals in audio applications. They can also bepreviously captured (or captured during a training phase—where certainsignals are intentionally not present, for example) signals or subsetsof signals, where the signals are images, video or wireline or wirelesscommunications signals.

In general, these is a need for creating intelligent trainingdictionaries that enable the rapid and useful convergence of iterativemachine learning techniques. An exemplary embodiment presents newmethods to improve training dictionaries by taking into accountmulti-sensor and multi-resolution information that is available in manyapplications.

These training signals or dictionaries can then be subsequently used asstarting points in subsequent machine learning iterative algorithms toimprove them. During this phase, the data being analyzed is no longertraining data and the purpose of the machine learning algorithm is toanalyze the non-training data via separation, classification,regression, clustering, dimensionality reduction, etc. Non-training datais any data that is not controlled or known in advance or determinable.As an example, in musical performances, instrument sound check iscontrolled and the existence of a solo in a recording is known inadvance (or can be determined by detecting it during listening). Thedata or signals recorded during these times can be classified astraining data or signals. Any other data or signals captured during amusical performance are non-training signals. These are the data orsignals upon which methods involving separation, classification,regression, clustering, dimensionality reduction, etc., are performedusing the training dictionaries determined with the training data orsignals.

SUMMARY

Aspects relate to a method that uses multichannel information whiletraining machine learning methods.

Aspects also relate to a method that improves a training dictionary inmulti-sensor scenarios.

Aspects also relate to a method that takes into account the effect ofthe acoustic path while training multichannel acoustic data.

Aspects also relate to methods that cope with time-frequency limitationsfor training machine learning algorithms.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the invention, reference is made tothe following description and accompanying drawings, in which:

FIG. 1 illustrates an exemplary schematic representation of amulti-microphone setup;

FIG. 2 illustrates an exemplary schematic representation of amulti-microphone and instrument setup for a music event;

FIG. 3 illustrates an exemplary schematic representation of a soundengineering setup for capturing a drum kit;

FIG. 4 illustrates an exemplary schematic representation of the effectof different acoustic paths in a drum sound;

FIG. 5 illustrates an exemplary schematic representation of a method forobtaining multichannel training data from recordings;

FIG. 6 illustrates an exemplary schematic representation of a method forobtaining multichannel training data in a live event;

FIG. 7 illustrates an exemplary schematic representation of a methodthat builds a training dictionary;

FIG. 7A illustrates an exemplary schematic representation of a methodthat builds a training dictionary for drums;

FIG. 7B illustrates an exemplary schematic representation of a signal'smagnitude spectrogram with two different time-frequency resolutions;

FIG. 8 illustrates an exemplary schematic representation of a methodthat combines training obtained from different time-frequencyresolutions;

FIG. 8A illustrates an exemplary schematic representation of anembodiment of the invention for dual time-frequency resolution training;

FIG. 8B illustrates an exemplary a method to generate W matrices;

FIG. 9 illustrates an exemplary schematic representation of anothermethod that combines training obtained from different time-frequencyresolutions;

FIG. 9A illustrates an exemplary schematic representation of anembodiment corresponding to the case of P=2;

FIG. 10 illustrates an exemplary representation of a multi-source,multi-sensor setup used to generate an improved training dictionary inthe form of matrix W; and

FIG. 11 illustrates an exemplary embodiment of how the improved trainingdictionary is used in order to improve tasks such as source separationin live music events or studio recordings.

DETAILED DESCRIPTION

Hereinafter, embodiments will be described in detail in accordance withthe references to the accompanying drawings. It is understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the present disclosure.

The exemplary systems and methods will sometimes be described inrelation to audio systems. However, to avoid unnecessarily obscuring thepresent invention, the following description omits well-known structuresand devices that may be shown in block diagram form or otherwisesummarized.

For purposes of explanation, numerous details are set forth in order toprovide a thorough understanding of the technology. It should beappreciated however that techniques herein may be practiced in a varietyof ways beyond the specific details set forth herein. The termsdetermine, calculate and compute, and variations thereof, as used hereinare used interchangeably and include any type of methodology, process,mathematical operation or technique.

Capturing Training Data from Multiple sensors: In the next paragraphs,exemplary scenarios are described in which multi-sensor data isavailable and methods for capturing such data so that it can be used toproduce intelligent training dictionaries. Note that in general, atraining dictionary captures inherent characteristics of the source data(for example spectral characteristics). Therefore, a training dictionarycan be more useful when multi sensor characteristics of data arecaptured.

FIG. 1 shows an exemplary embodiment of a multi-microphone setup where 4microphones 102, 104, 106, 107 acquire the sound of 3 sources 101, 103and 105. Three out of four microphones (102, 104 and 106) are meant tomainly capture the sound of individual sources (101, 103 and 105respectively) and these can be typically called close-microphones. Onthe other hand, microphone 107 which can be referred to as an ambientmicrophone is not meant to capture a specific sound source, but ratherthe complete soundscape. In some embodiments, a close microphone can bean electromagnetic microphone or any type of microphone with lowersensitivity, while an ambient microphone can be a condenser microphoneor any type of microphone with higher sensitivity. Despite theirintended usage, all microphones capture not only the sound of theirsource of interest but also to some extent the sound of all (or some) ofthe other sources. This phenomenon is called microphone leakage (ormicrophone bleed or microphone spill) and is prominent in everyreal-life multi microphone setup. However, the captured sound of eachsource is not the same in each microphone due to the different signalpaths. Here, the term signal path can refer to the acoustic path, i.e.,the path the sound followed between the source location and themicrophone at which the acoustic signal is captured. The effect of theacoustic path is to cause the source signal to change from the moment itis produced at the source location until it is captured by themicrophone. Such changes include but are not limited to the attenuationfrom the source-microphone distance and the transfer medium, the effectof individual reflections, the room reverberation, etc. There is aunique acoustic path between a given sound source location and a givensensor position.

The previous description of a signal path can be extended tocommunications paths as well, where the communications path is thatbetween the communications source and the communications receiver. Thecommunications path takes into account the communications channel,channel noise (thermal or other) and interference from othercommunications sources, and the effects of signal fading that may occurbecause of multipath reflections. In other embodiments, the signal pathrefers to any transformation and/or processing that occurs in the sourcesignal after being produced and before being captured by a sensor. Thesignal path can imply convolution and/or addition or any othertransformation.

FIG. 2 shows an example of a multi-microphone setup for an exemplarylive or studio music event. A number of microphones 202, 203 are used torecord a drum set 201 and microphones 205, 207 and 209 are used torecord a bass 204, a guitar 206 and a singer 208, respectively. Aspreviously discussed, all microphones capture all sound sources due tothe microphone leakage phenomenon. Since the acoustic paths from eachsource to each microphone are not the same, the captured sound of eachsound source will be different in each microphone. For example, thesound of the guitar 206 will be typically captured more clearly by theclose microphone 207 where the effect of the acoustic path is small oreven negligible. On the other hand, all other microphones 202, 203, 205and 209 also capture the guitar sound. However, these other microphonescapture somewhat different versions of the guitar source due to thedifferent acoustic paths. From a sound engineering perspective, allthese different versions of a captured sound can be useful and can beinvolved in the final sound mix or arrangement. Therefore, the uniquecharacteristics of each one of the infinite acoustic paths can beimportant from the sound engineering perspective and this is why a soundengineer may sometimes place multiple microphones just to capture thesame source.

In the art, in order to perform training or provide a reference of analgorithm that would process, for example, the guitar source 206 onecould use: (a) representative recordings from any similar guitar, (b)recordings of the actual guitar 206 captured from a random microphone,(c) recordings of the actual guitar 206 and a dedicated microphone (thiswould typically be the close microphone 207). As described above, a newform of training is described that utilizes sound captured across amultiple of the setup microphones 202, 203, 205, 207, 209.

In FIG. 3, an exemplary setup of a drum kit and correspondingmicrophones are presented. A drum kit is usually built from severaldrums and cymbals. Dedicated microphones are usually placed in order tocapture the sound of some of the individual sound sources. For example,in FIG. 3 the sounds of the kick drum 303, the snare drum 304, the floortom 302, the mid tom 307 and the high tom 309 are captured frommicrophones 314, 305, 301, 308, 310, respectively. This exemplary drumkit also contains several cymbals: a hi-hat 306, a ride 312 and 2 crashcymbals, 313 and 311, that aren't associated with a close microphone.There are also 2 ambient microphones 315 and 316, which can typicallycapture the acoustic image of the drum kit as a whole. It's important tonote that even for the case of a drum kit where the sound sources arerelatively close to one another, the effect of the acoustic paths on thecaptured sound is very significant. Therefore, the sound of each soundsource will be significantly different in each one of the microphones.

FIG. 4 shows an illustrative example of the effect of different acousticpaths for a sound recording (for example a kick drum). In 401 the timedomain signal as recorded from the close microphone is shown. In 402,and 403 the recordings from 2 other microphones are shown. The effect ofthe acoustic path is clear even in the time domain, since the differentsource-receiver distances have resulted in a different amplitude for thethree recordings 401, 402, 403. The effect is also prominent in thefrequency domain, where the effect of individual reflections from nearbysurfaces and the room reverberation will be also visible. The uniquequalities of each captured version of this sound signal are importantfrom a sound engineering perspective and can be used by a human or analgorithm in order to provide a superior mixing result.

Traditional training methods of machine learning algorithms did not takeinto account the effect of the acoustic path. In order, for example, totrain a machine learning algorithm for the snare drum 305, one could useany available recording of archetypical snare drums or any availablerecordings of the specific snare drum. However, exemplary embodimentscan take into account the different acoustic paths from the snare drum305 to one or more of the available microphones 301, 305, 308, 310, 314,315, 316.

In a specific embodiment, the signal path can be taken into accountimplicitly by using in the training phase of one or more sounds capturedfrom the additional microphones (for example microphones 301, 305, 308,310, 314, 315, 316). In other embodiments, the signal path can be takeninto account explicitly by modeling the signal path contribution.

In another embodiment, multi-sensor data representing each individualsource can be obtained. These data can be used in any machine learningalgorithm, for example in a source separation algorithm. For example, inthe case of audio signals, solo recordings of single audio sources inmore than one microphone can be obtained. There are many ways to obtainsuch data and they are all in the scope of the present disclosure. Forexample, in many music arrangements it's quite frequent to locate partswhere an instrument has a solo part. In these cases, it is possible toobtain the multichannel segment of the solo instrument in all availablemicrophones. FIG. 5 shows an exemplary embodiment where multichannelsegments of solo sources 502 are identified in a multichannel recording501. The identification of the segments of interest can be done manuallyfrom a user (e.g., a sound engineer) or automatically via an appropriatealgorithm. Then, the solo multichannel segments are separated from therest of the recording 503 and used for multichannel training of a singlesource 504. Since each solo source is captured from all availablemicrophones, different acoustic paths are taken into account duringtraining. The proposed approach can be also applied in real-time withouthaving access to the recordings. In exemplary embodiments, a user or analgorithm can turn training on and off during an event and the trainingdictionary changes in real-time. Note that the training signals ortraining results may be captured in advance (during a start-up orinitialization or sound check timeframe) as well as during the steadystate operation or during a performance.

In other embodiments, dedicated recordings can be made in order toensure that solo parts of single sources are available. In live orstudio setups a sound-check step can precede the actual performance.During the sound-check the sound engineers and technicians prepare thestage, place the microphones, connect and test the equipment, tune theinstruments and the sound system, etc. During the sound-check there'stypically enough time to capture multichannel recordings from allpossible sources (including but not limited to singers' voices,electronic or acoustic musical instruments, monitor speakers, PAspeakers, etc) in all available microphones and then use some or all ofthem for training machine learning algorithms. The captured multichanneldata will contain information not only from the sources of interest butalso for all relative (on and off stage) acoustic paths.

In other embodiments, the multichannel data can be immediately used foron-the-fly training without the need for an actual recording or storedfor later use. In other embodiments, the multichannel training data canbe obtained in advance before the live event or studio recording. Inother embodiments, the multichannel training data can be usedinterchangeably between similar microphone and acoustic setups. Thetraining results can be used to train any machine algorithm in real-timeduring the performance or afterwards for post-processing. Anyone cantrigger the sound sources including but not limited to the actualperformers (e.g., singers, musicians), members of the technical crew orother individuals, automatic algorithms, mechanical devices, etc.

In a particular embodiment, multichannel training can be applied in alive event. After the setup of the event stage is made 601, the soundengineers can decide on the positions of the sound sources (for examplethe musical instruments) 602 and the sound receivers (for example themicrophones) 603. In this way some or all relevant acoustic paths can bedefined and can be kept relatively unchanged during the live event. Theneach sound source can be “triggered” in order to capture the sound inall available microphones 604. The sound can be either recorded or usedfor on-the-fly machine learning training 605. Then the training results(i.e., the training dictionary) can be used during the live event 606 inany supervised or semi-supervised algorithm (that is, any algorithm thatcan take advantage of prior knowledge to assist in finding a solution)or even after the live event for post-processing. In the case that theposition of a microphone or a source changes during the live event, thenthe acoustic path can also change. In one embodiment, the relevanttraining results (i.e., the training obtained from this microphoneand/or source) can sometimes be omitted by the machine learningalgorithm. In a particular embodiment the sound engineer of a live eventcan select whether certain training results will be taken into accountor not via an appropriate interface. In another embodiment, the locationof all sources and microphones is monitored from a video camera or anyother appropriate device and an algorithm decides dynamically whethercertain training results will be used or not.

In another embodiment, multichannel training can be applied in any audiorecording application. A group of microphones are used in order tocapture one or more sound sources, for example in a professionalrecording studio or a home studio. The sound of each sound source iscaptured from any available microphone, resulting in alternate versionsof the sound sources due to the different acoustic paths. In a studioand/or recording, it's common that the acoustic paths will not changebetween the training phase and the recording phase and thereforeembodiments of the present invention can sometimes be applied withoutcontrolling the usage of the training results.

The duration as well as the specific characteristics of the multichanneltraining data can play an important role on whether the producedtraining dictionary is beneficial for the machine learning task at hand.For example, in a live event or in a recording studio, the duration ofthe training data of each instrument must be long enough to ensure thatall details of the instrument as played by the specific musician will becaptured. In addition, it's advantageous to play the instrument in manydifferent ways so that all possible performance variations are captured.

In the previous paragraphs, a number of exemplary scenarios weredescribed in which multi-sensor data is available and can be capturedfor training purposes. FIG. 10 discloses a number of exemplaryembodiments that in general describe capturing training signals thatprovide information about multiple sources s_(n) (depicted as 10, 11, 12in FIG. 10) as received by multiple sensors x_(m) (depicted as 13, 14and 15 in Figure A) over multiple acoustic paths a_(mn) (not shown).These training signals are then converted from analog to digital usingA/D converters 16, 17 and 18. Note that the signals received in thesensors may also be processed or altered in the analog domain prior toA/D conversion. Subsequent to digitization, processing unit A (19 inFIG. 10) (which can include one or more processors, memory, storage,digital signal processor(s)) analyzes the digitized sensor data toproduce training bases (elements of the training dictionary) thatrepresent the various combinations of sources s_(n) to sensors x_(m) viaacoustic path a_(mn). The training dictionary generated by processingunit A (19 in FIG. 10) is symbolized by the W matrices. Subsequentsections describe a number of embodiments to generate the W matrices.Also described in FIG. 8B is a specific method to generate the Wmatrices. The generated W matrices are stored as depicted as 20 in FIG.10.

FIG. 11 show an exemplary embodiment of how the W matrices generated (asdescribed in FIG. 10) are subsequently used to analyze non-training datato perform any number of tasks, including improved source separation.Audio source data s_(n) (depicted as 50, 51, 52 in FIG. 11) as receivedby multiple sensors x_(m) (depicted as 53, 54 and 55 in FIG. 11) overmultiple acoustic paths a_(mn) (not shown). These audio signals are thenconverted from analog to digital using A/D converters 56, 57 and 58.Note that the signals received in the sensors may also be processed oraltered in the analog domain prior to A/D conversion. The digitizedsignals are then processed in Processing Unit B (59 in FIG. 11) whichuses the W matrix determined in FIG. 10 (as shown in 75 in FIG. 11) andproduces output estimates y_(n) (60, 61 and 63 in FIG. 11) of thedigitized source signals x_(m), i.e., the outputs are the separatedsources generated by NMF processing appropriately converted back to thetime domain. In an audio source separation application, these outputsare audible signals that represent each of the source signals s_(n)captured by the sensors x_(m). This process is described in detail in asubsequent section and is depicted within FIG. 8B. Processing Unit A andProcessing Unit B may be within a computer, in one or more DSPs, part ofa digital audio workstation or part of a sound console and can beimplemented in software or hardware or any combination thereof.

In an exemplary embodiment, let's consider M microphones capturing Nsound sources. The captured sound signals can be in the time domain ortransformed to any other appropriate form. For example, one can obtain atime-frequency representation of sound signals by transforming them tothe time-frequency domain with any method including but not limited to ashort-time Fourier transform (STFT), a wavelet transform, a polyphasefilterbank, a multi rate filterbank, a quadrature mirror filterbank, awarped filterbank, an auditory-inspired filterbank, a tree-structuredarray of filterbanks, etc. Although embodiments will refer to aspectrogram, it is apparent to anyone skilled in the art that anyappropriate representation can be used without limiting the scope of thedisclosed technology. All subsequent references to a time-frequencytransform or transformation can include any one or more of the abovemethods.

In a multi-source, multi-sensor setup, let x_(m)(k) be the digitalmicrophone signal of the m-th microphone, where k is the discrete timeindex. As discussed previously this signal captures all source signalsto some extent, that is:

$\begin{matrix}{{x_{m}(k)} = {\sum\limits_{n = 1}^{N}\; {s_{m,n}(k)}}} & (1)\end{matrix}$

The signal s_(m,n)(k) represents the sound of the n-th source ascaptured by the m-th microphone. It is understood here that the signals_(m,n)(k) includes the effect of the acoustic path between the n-thsource and the m-th microphone. One can transform the time domain signalx_(m)(k) to the time-frequency domain using any appropriate transform.In one embodiment, the short time Fourier transform (STFT) is used toobtain the complex values X′_(m)(f, t) where f is the discrete frequencyindex and t is the time frame index. The magnitude values in some domainδ, are obtained as:

X _(m)(f,t)=|X′ _(m)(f,t)|^(δ)  (2)

where δ>0. For each time frame t the values of X_(m)(f, t) for allfrequency bins f, can be arranged in a column vector x_(m)(t) of sizeF×1. All vectors x_(m)(t) can be arranged in a matrix X_(m)ε

₊ ^(F×T) which is the magnitude spectrogram of the recorded microphonesignal x_(m)(k) in domain δ. X_(m) is a non-negative matrix with F rows,where F is the number of discrete frequency bins and T columns, where Tis the number of frames. In the case of STFT, F is controlled by the FFTsize and T is controlled by the hop size. Again, since each microphonecaptures the sound from all active sources, the spectrogram can beapproximately written as:

$\begin{matrix}{X_{m} \approx {\sum\limits_{n = 1}^{N}\; S_{m,n}}} & (3)\end{matrix}$

where S_(m,n)εε

₊ ^(F×T) is the magnitude spectrogram of the sound of the n-th source ascaptured by the m-th microphone. Note that all spectrograms in Eq. (3)are in the same domain δ. As discussed in FIGS. 1, 2 and 3, eachspectrogram S_(m,n) describes the sound of the n-th source under theeffects of the acoustic path between the n-th source and the m-thmicrophone.

In one embodiment, the sources S_(m,n) can be extracted from themicrophone spectrogram X_(m). In order to perform source separation, anyappropriate technique can be used including but not limited tonon-negative matrix factorization (NMF), non-negative tensorfactorization, independent component analysis, principal componentanalysis, singular value decomposition, etc. In some embodiments NMF canbe used to separate the sources. Each source spectrogram can beexpressed as:

S _(m,n) =W _(m,n) H _(m,n)  (4)

where W_(m,n)ε

₊ ^(F×K) ^(n) is a matrix that contains a set of bases which candescribe the spectral properties of the sound of the n-th source ascaptured by the m-th microphone. Each base is a column of W_(m,n) anddescribes one fundamental aspect of the sound source in the domaindefined by the F discrete frequency bins. Without loss of generality,assume that the first source (n=1) is a kick drum. W_(m,1)H_(m,1)defines a model of the source spectrogram S_(m,1) where K₁ is the chosenorder of the model and represents the number of columns in the basismatrix W_(m,1) and therefore K₁ defines the number of elements intowhich the source can be decomposed or separated or split (each of theseterms are meant to be used interchangeably herein). In some cases, theorder of a model is chosen to be higher for complex sources and lowerfor simpler sources. If, for example, one works in the time-frequencydomain then S_(m,1) is a spectrogram of the kick drum recording inmicrophone m and W_(m,1) are spectral representations of one or moreelements of the kick drum sound as captured from microphone m. For eachof the T audio frames, the matrix H_(m,n)ε

₊ ^(K) ^(n) ^(×T) contains the activation functions (or gains orweights) for each basis function. Each row of H_(m,n) indicates howactive the corresponding column of W_(m,n) is in that particular timeframe. When the basis functions of W_(m,n) are combined according to theactivation functions of H_(m,n) an estimate of the spectrogram S_(m,n)is produced.

In order to perform training in supervised or semi-supervised NMF,specific, prior knowledge of one or more sound sources can be leveraged.This prior knowledge can sometimes provide a prior or initial estimateof one or more of the corresponding basis functions in W_(m,n)ε

₊ ^(F×K) ^(n) . Estimates of the basis functions can be called atraining dictionary.

As described above, in prior systems multi-sensor information is notexplicitly taken into account while training, and the effect of theacoustic path is effectively neglected. In many applications there isonly one signal path of the source signal that's interesting for theuser. For example, in the case of medical imaging, there's only one“true” representation of a source signal that corresponds best to thephysical reality. In other applications such as music relatedapplications, each version of a sound source as captured from differentmicrophones can be useful and open up new creative possibilities for themusicians/sound engineers. This is because the acoustic path (althoughit's sometimes considered a sound distortion) can become an inherentelement of the sound and contribute to the auditory experience oflistening to the sounds. Inspired by this idea, the present technologyextends the notion of multichannel training to allow for expansion inthe feature domain so that sounds from a source that are captured ineach microphone are considered. An example of this is where a microphoneother than the singer's microphone will pick up the singer's voice(e.g., the guitar microphone that is nearby). This introduction of newdegrees of freedom in the choice of basis functions expands the trainingdictionary used to assist in convergence of iterative algorithms (suchas, but not limited to, NMF). The new training dictionary includes basisfunctions (or bases) that account for the specific acoustic paths andfor bases that are dependent upon the location of the sources andsensors and the fact that each sensor may have relevant informationabout multiple sources.

In effect, training data for many individual source-sensor pairs can beproduced and therefore the technology allows the expansion of thefeature domain and obtaining of features that are tailored to themulti-sensor environment that one is encountering. In particularembodiments this can be done by using solo recordings of some or all ofthe sound sources in one or more of the available microphones, using themethods provided in FIGS. 5 and 6.

In another embodiment, the multichannel training data obtained duringthe training phase can be processed before producing the trainingdictionary. In some embodiments it is beneficial to identify and removesilence parts from the training data before using them to produce thetraining dictionary. The silence removal procedure can be madeautomatically or by a user and can be made in the time domain or in thetime-frequency domain or in any other domain. The motivation forremoving the silence parts before producing the training dictionary isthat silence is not a representative characteristic of the training datathat one necessarily wants to capture and might skew the trainingdictionary to contain non-relevant information.

In one embodiment, a tensor unfolding technique is used to account formultichannel (or multi-sensor) information. In this case, the followingobservation can be made: each microphone records all of the soundsources approximately at the same time instant. That is, when the n-thsource is active, it is active in all microphones at the same time. Thisholds for reasonable distances between microphones so that the timedifference between each microphone fits within one time frame. HenceH_(i,n)=H_(j,n)∀i,j=1, 2, . . . , M and thus Eq. (4) becomes:

S _(m,n) =W _(m,n) H _(n)  (5)

and therefore the activation functions for each source H_(n) are commonacross all M microphones.

In addition, a matrix W_(m) is defined, which contains the set of bases(or basis functions) that describe all the sound sources as captured bythe m-th microphone:

W _(m) =[W _(m,1) W _(m,2) . . . W _(m,N)]  (6)

The matrix W_(m) is of size F×K (where K=Σ_(n)K_(n)). As discussed aboveeach matrix W_(m,n) includes K_(n) bases that describe the n-th sourceas captured by the m-th microphone. Hence, the matrix W_(m) contains allthe bases that describe how all of the N sources are captured by them-th microphone. In addition, we can define the matrix H of size K×N,which contains gains for the basis functions in W_(m):

H=[H ₁ ^(T) H ₂ ^(T) . . . H _(N) ^(T)]^(T)  (7)

By combining (4), (6), (7) we have:

X _(m) =W _(m) H  (8)

Therefore the matrix W_(m) captures the spectral properties of eachsound source in microphone m, while H captures the correspondingtime-domain activations. Now, let us formulate the multichannelspectrogram as:

X=[X ₁ ^(T) X ₂ ^(T) . . . X _(M) ^(T)]^(T)  (9)

The multichannel spectrogram is a collection of the individual channelspectrograms and reflects the time-frequency characteristics of allsources as captured by all microphones. Then (6) can be written as:

X=WH  (10)

where Wε

₊ ^(MF×K) can be written as:

W=[W ₁ ^(T) W ₂ ^(T) . . . W _(M) ^(T)]^(T)  (11)

Since each matrix W_(m) describes how all sources are captured in eachmicrophone m, the multichannel basis matrix (i.e., the dictionarymatrix) W describes how all sources are captured by all microphones. Whas a well-defined structure and can be written as a block matrix. In anexemplary embodiment, the multi-sensor training scheme presented hereincan be applied to the tensor unfolding scenario by combining (6) and(11):

$\begin{matrix}{\overset{\sim}{W} = \begin{bmatrix}W_{1,1} & W_{1,2} & \cdots & W_{1,N} \\W_{2,1} & W_{2,2} & \cdots & W_{2,N} \\\vdots & \vdots & \ddots & \vdots \\W_{M,1} & W_{M,2} & \cdots & W_{M,N}\end{bmatrix}} & (12)\end{matrix}$

Each submatrix W_(m,n) contains the set of bases that describe how then-th source is captured by the m-th microphone. The “columns” of theblock matrix W (W_(m,n) for a given n) describe how each source iscaptured by all microphones, while the “rows” of the block matrix W(W_(m,n) for a given m) describe how all sources are captured by eachmicrophone. The NMF framework which provides the factorization (10) canbe semi-supervised or supervised where a part or all of W respectivelyis known beforehand via some form of training. In some embodiments, morerows or columns can be added in W in order to form a new dictionarymatrix W. The columns or rows can be initialized with any appropriatemethod. These extra rows or columns can sometimes account forcharacteristics that are not captured at the training phase.

In other embodiments, means to obtain the blocks of the matrix W areprovided. As a first step one can obtain the multichannel spectrogram X^(n) when only the n-th source is active. This can be done with anymethod, including but not limited to the methods discussed in FIGS. 4, 5and 6. Then one can factorize the multichannel spectrogram as:

X ^(n) =W ^(n) H  (13)

where W ^(n) is the n-th “column” of the block matrix (12), that is allsubmatrices W_(m,n) for a given n and m=1, 2, . . . , M.

W ^(n) =[W _(1,n) ^(T) W _(2,n) ^(T) . . . W _(M,n) ^(T)]^(T)  (14)

In general W ^(n) can be interpreted as a dictionary that describes thesound of the n-th source in different microphones. The information inmatrix H can be used to further constrain the analysis NMF problem orcan be discarded. In another embodiment, W ^(n) can be equal to X ^(n),or it can be formed as any appropriate submatrix of X ^(n). Any methodthat extracts all or part of the training dictionary from X ^(n) is inthe scope of the present invention. Note that during training not allsources may be available. If so, one can initialize the missing“columns” of the block matrix W with any appropriate method. In otherembodiments, the multi-sensor training methods described herein can becombined with traditional training techniques, where, for example, asingle-channel NMF can be performed in order to obtain each element ofthe matrix W of (12).

FIG. 7 presents an exemplary embodiment, where multi-sensor data fromone or more sources are used for producing improved intelligent trainingdictionaries that capture multi sensor information. At first,multi-sensor data from one source is obtained 701 and used to extractthe dictionary elements that correspond to that source with anyappropriate method 702. This procedure is repeated 703 for all sourcesfor which multi-sensor data is available. In other embodiments, onemight choose to not use all available data and perform training only onthe most significant data. Then all the elements from different sourcesare used to build the dictionary matrix 704. Finally, if there's a need,more columns or rows can be added to the matrix 705.

In an exemplary embodiment, the multichannel training procedure isapplied to drums. Drums are typically captured by more than onemicrophone and therefore it can be beneficial to use multichanneltraining. For the case of drums, in some embodiments the drummer playssingle drum hits of one or more drum elements, which are captured by oneor more microphones and stored in a storage unit. The recordings can bethen used to produce a training dictionary. In other embodiments, thedrummer plays actual playing variations of a single drum element, whichare recorded and stored in a storage unit/device/system(s) and theserecordings can be then used to produce a training dictionary. In otherembodiments, the drummer is required to play both single hits and actualplaying variations of one or more of the drum elements, which arerecorded and stored in a storage unit. The above single or multichanneldrum element recordings as captured by one or more of the availablemicrophones can be used to produce one or more training dictionaries.

In another embodiment, the multi-sensor training procedure describedherein can be applied directly in a non-negative tensor factorization(NTF) framework. Instead of “stacking” microphone spectrograms X_(m) asin Eq. (9), consider creating a 3^(rd) order tensor X with dimensionsF×N×M. Any NTF model is in the scope of the present embodiment, althoughfor exemplary reasons the exemplary embodiment uses the PARAFAC model(see Section 1.5.2 in A. Cichocki, R. Zdunek, A. H. Phan, S.-I. Amari,“Nonnegative Matrix and Tensor Factorization: Applications toExploratory Multi-way Data Analysis and Blind Source Separation”, JohnWiley & Sons, 2009) (which is incorporated herein by reference in itsentirety). This model is written for each element of the involvedmatrices as:

$\begin{matrix}{x_{fnm} \approx {\sum\limits_{k = 1}^{K}\; {w_{fk}h_{kn}q_{km}}}} & (15)\end{matrix}$

where Wε

₊ ^(F×MK) and Hε

₊ ^(MK×N) represent the same quantities as in Eq. (4). Matrix qε

₊ ^(K×M) represents the contribution (or gain) of each source (orcomponent discovered by the NTF) in each of the M channels/microphones.Consider reshaping the training matrix W ^(n) of eq. (13) and (14) as:

{tilde over (W)} ^(n) =[W _(1,n) W ^(2,n) . . . W _(M,n)]  (16)

Based on each of the N sources multi-sensor training matrix {tilde over(W)}^(n) one can create a total training matrix Wε

₊ ^(F×MK):

{tilde over (W)}=[{tilde over (W)} ¹ {tilde over (W)} ² . . . {tildeover (W)} ^(N)]  (17)

The matrix of Eq. (16) can be used with Eq. (15) to provide a supervisedor semi-supervised NTF of the tensor X.

FIG. 7A presents an exemplary embodiment where multichannel instrumentrecordings (e.g., recordings of isolated drum elements) are used toproduce a multichannel training dictionary. Initially single hits and/oractual playing variations are captures from a drummer 7001 and stored inan appropriate medium 7002, for example the hard drive of a personalcomputer. Optionally the silence parts 7003 can be removed, eitherautomatically or manually. After the optional silence removal step, atime-frequency domain transform is performed and the correspondingspectrograms are created 7004 for each microphone signal. Then themicrophone spectrograms are stacked 7005 and multichannel NMF isperformed 7006. The basis functions are then extracted that correspondto the specific drum element in each microphone 7007 and the basisfunctions 7008 stored. These basis functions will contain informationfor the relative acoustic paths. Then the procedure is repeated forevery drum element for which one wants to provide a contribution to thetraining dictionary 7009. Then the basis functions are combined andcreate the improved multichannel training dictionary 7010 created. Thistraining dictionary can be then used in, for example, every recording ofthis drum set, in order to perform machine learning tasks such as sourceseparation in one or more of the available microphones Although thisexample has been presented for drum recordings, it's obvious to anyoneskilled in the art that the same principle may apply to other instrumentrecordings, or to any recorded sound in general.

Training Data Captured Using Multiple Time-Frequency Resolutions:

Another exemplary use of the data captured in training (as describedabove) in order to generate an intelligent training dictionary follows.When calculating any time-frequency transform the time-frequencyresolution is one of the most important trade-offs one has to make,since the Heisenberg-Gabor limit imposes that a function cannot be bothtime limited and band limited. Therefore, signal processing methodsbehave differently depending on the chosen time-frequency transformand/or resolution. FIG. 7B illustrates an example of the effect ofdifferent time-frequency resolutions. 711 shows two sine waves ofdifferent frequencies 300 Hz and 310 Hz, separated by silence (timegap). Using the STFT and choosing a short window length results in goodtime resolution and poor frequency resolution or, in other words, thesignals are well localized in time and poorly localized in frequency.This is shown in 712 where the time gap between the signals is clearlyvisible, while their frequency content is spread across severalfrequency bins and the content of both signals overlaps significantly infrequency. While a signal processing algorithm could detect the twoseparate events in time easily, it would be more difficult to find outwhich event corresponds to which frequency content. On the other hand,choosing a long window length results in poor time resolution and goodfrequency resolution or in other words the signals are poorly localizedin time and well localized in frequency. This is shown in 713 where thetime gap between the two signals is no longer as clear while thefrequency content of each signal has become more defined. A signalprocessing algorithm could easily detect the two different signals infrequency but it would be difficult to estimate when each signal beginsand ends.

One of the exemplary purposes of this technology is to describe a newmethod to relax the requirement to choose a single time-frequencyresolution when performing signal processing functions and overcome thelimitations shown in FIG. 7B. In an embodiment, different time-frequencytransforms are calculated on the same data, each with a differenttime-frequency resolution. In general, the time frequency resolutionthat a specific time frequency transform accomplishes is based theselected values of the parameters of the transform. In the case of STFTthe chosen window length L dictates the time-frequency resolution. Thelength L can be defined in samples or time duration. Assume a digitalsignal x(k) is sampled with a sample rate fs. Performing an STFT on x(k)with a window length of 256 samples and another STFT on the same signalwith a window length of 1024 samples will produce different results asshown in FIG. 7B, due to the different time-frequency resolutions. Ifthe digital signal x(k) was sampled at a higher sample rate of 2 fs, anSTFT with a window length of 512 samples would be required to providethe same time time-frequency resolution as the window length of 256samples at fs. These different time frequency transforms capturedifferent aspects of the original time domain signal and represent themas different distributions of time-frequency energy on a grid oflinearly spaced frequency bands and time frames. In order to use thesedifferent representations within a single iterative technique, theoutputs of the different time-frequency transforms are “mapped” to a newtime-frequency domain with a common frequency resolution. Note that thetime-frequency grid may not be common between different transforms andthis is manageable within the constructs of the training dictionaries asdescribed in more detail below. The time-frequency mapping can be anyoperation that changes the spacing and/or the number of the frequencybands, including combining certain frequency bands, adding bands,averaging bands, or the like. Note, that this operation does not alterthe time-frequency resolution but its representation on a specifictime-frequency grid. Simply put it's another way to look at the dataproduced by a specific time-frequency transform and allows thecomparison and common handling of transforms with differenttime-frequency resolutions. After the mapping, the resulting commonfrequency bands may be uniform or non-uniform, that is, a single mappedband may represent a larger portion of the overall spectrum than anothermapped band. Note that this type of transformation from a time signal tonon-uniform frequency bands can also be accomplished with othertime-frequency transforms, in addition the method described above usinga STFT followed by a frequency mapping. This is to be understood in thesequel as well, where we describe in detail the STFT followed byfrequency mapping approach, but it is understood that any technique thattakes a time domain signal and creates multiple time frequencyrepresentations which capture different aspects of the signal and have acommon set of frequency bands.

A signal processing algorithm (such as an NMF decomposition) is appliedto the mapped transforms to provide a result that benefits from the factthat information is available regarding multiple time-frequencyresolutions. In another embodiment, a signal processing algorithm isapplied to the output of each different time-frequency transform. Theresults of each algorithm are then mapped to a new time-frequency domainwith a common frequency resolution and combined.

In a particular embodiment, a training dictionary is created for a soundsource which will capture aspects of the source in differenttime-frequency resolutions simultaneously. Assume that one has at leastone training signal x(k) that is for example, a recording of the soundsource of interest. Using the STFT and similarly to (2) one canconstruct a set of magnitude spectrograms X_(p)ε

₊ ^(F) ^(p) ^(×T) ^(p) in some domain δ, from the training signal x(k)using P STFTs with window lengths L_(p) with p=1, 2, . . . , P. Eachspectrogram X_(p) describes the same training signal x(k) in thetime-frequency domain with a different time-frequency resolution andcaptures different aspects of the signal. Note that in otherembodiments, any appropriate time-frequency transform can be used toextract multiple spectrograms X_(p) from the same signal x(k) and allare within the scope of the present disclosure.

A set of appropriate frequency “mapping” matrices B_(p)ε

₊ ^(F) ^(B) ^(×F) ^(p) is constructed that can be used to produce setsof new spectrograms V_(p)ε

₊ ^(F) ^(B) ^(×T) ^(p)

V _(p) =B _(p) X _(p)  (18)

The matrix B_(p) maps the set of spectrograms X_(p) to a set ofspectrograms V_(p) with a common number of frequency bands F_(B). Theset of spectrograms V_(p) describe various aspects of the trainingsignal x(k) made observable under different time-frequency resolutionsin a time-frequency domain with a common frequency resolution. In otherembodiments, the set of spectrograms V_(p) can be produced by choosing acommon FFT size for all P STFTs. In this case F_(B)=F_(p) for all p.

In a particular embodiment, each V_(p) can produce parts of a trainingdictionary. For example, one can perform one NMF per matrix V_(p) toobtain a factorization V_(p)=W _(p)H_(p) where Wε

^(F) ^(B) ^(×K) ^(p) . The result of the different factorizations can becombined as:

{tilde over (W)}=[W ₁ W ₂ . . . W _(p)]  (19)

where {tilde over (W)}ε

₊ ^(F) ^(B) ^(×K′) where K′=Σ_(p=1) ^(P)K_(p). The matrix W _(p)contains a set of basis functions that model the spectral properties ofthe source of interest with p-th time-frequency resolution. The numberof basis functions K_(p) can be different for each matrix W _(p). Thisreflects the fact that different time-frequency resolutions bring outdifferent aspects of the sources, which require different modelingparameters. {tilde over (W)} is a matrix that contains the complete setof basis functions in a common frequency domain. These basis functionsdescribe fundamental aspects of a sound source as captured fromdifferent time-frequency resolutions of the same training data. Notethat in other embodiments, any appropriate method can be used to extractW _(p) from V_(p) and they are all within the scope of the presentdisclosure.

FIG. 8 shows another exemplary embodiment. The training data x(k) areobtained 801 and the steps included in 800 are used to produce thetraining dictionary {tilde over (W)}. More specifically, a first STFTtransform with window length L₁ is applied 802 to provide thespectrogram X₁. A second STFT transform with a different window lengthL₂ is applied 803 on the same training data to provide the spectrogramX₂. The process is repeated until the final STFT transform with a windowlength L_(P) is applied 804 to provide the magnitude spectrogram X_(p).Then an appropriate frequency mapping is performed with a differentmapping matrix B_(p) for each STFT 805, 806, 807 to get the set ofspectrograms V_(p). A non-negative matrix factorization is performed808, 809, 810 on each spectrogram V_(p) to obtain the basis functionmatrices W _(p). The matrices are then combined 811 to provide thetraining dictionary {tilde over (W)} 812.

FIG. 8A shows an exemplary embodiment for dual time-frequency resolutiontraining. Note although the example here is presented for 2time-frequency resolutions, the presented method is valid for any numberof time-frequency resolutions. Typically, the number of chosenfrequency-resolutions is decided based on the complexity of the sourcesand the task and it often comes as a trade-off between betterperformance and increased computational load. A training signal 821 iscaptured in the time-domain. A first STFT transform with window lengthL₁ 822 is applied on the signal to produce the first spectrogram X₁ witha specific time-frequency resolution. A second STFT transform withwindow length L₂ 823 is applied on the same signal to produce a secondspectrogram X₂ with a different time-frequency resolution. Thesespectrograms are mapped 824, 825 by frequency mapping matrices B₁ and B₂to a common frequency domain with F_(B) bands and the spectrograms V₁,V₂ are produced. A first NMF 826 is applied on V₁ to produce W ₁. Eachcolumn of W ₁ describes one of the spectral properties of the trainingdata for the first time-frequency resolution. A second NMF 827 isapplied on V₂ to produce W ₂. Each column of W ₂ describes one of thespectral properties of the training data for the second time-frequencyresolution. The two matrices W ₁, W ₂ can be combined since theydescribe the spectral properties of the same training data on the samefrequency domain but with different time-frequency resolutions. Thiscombination results in the training dictionary {tilde over (W)}. Inother methods, improved training dictionaries extracted by methods shownin FIG. 8 and FIG. 8B can be combined to form a new improved trainingdictionary.

FIG. 8B shows an exemplary embodiment where the training dictionary{tilde over (W)} is used in a source separation application for musicrecordings or live events. A recording x(k) of a first source isobtained 831. This could be for example a recording of an acousticguitar with a microphone. This recording could be obtained during thesoundcheck phase of a concert. This signal is used to construct atraining dictionary 800 (as also described in detail in FIG. 8). Thetraining dictionary that is generated as described in 800 is designated{tilde over (W)} and is stored 832 to be used later. Note that the stepsdescribed in 831, 800 and 832 are directly related to what is describedin FIG. 10. Another recording y(k) that contains the first source and asecond source is obtained 833. This could be for example a recording ofan acoustic guitar and a singer with the same microphone during theperformance of an acoustic song. This recording could be obtained duringthe concert that follows the soundcheck phase. Typically the steps thatbegin at 833 occur after the dictionary that is created in 832 has beengenerated. This could also be a recording of a different acoustic guitaror in a different place or by a different musician. A source separationalgorithm will attempt to extract an estimate of the acoustic guitarsignal and the voice signal from the microphone signal y(k). Thespectrogram Y of the microphone signal is produced using an STFT 834with a window length L_(y). This window length may be different than anywindow length L_(p) used to obtain the dictionary or the same as one ofwindow lengths L_(p). One can use an appropriate frequency mappingmatrix B_(y) 835 to obtain Y_(B) which represents the recording 833 inthe common time-frequency domain of the training dictionary. Then onecan use NMF 836 to factorize Y_(B) as:

$\begin{matrix}{Y_{B} = {\lbrack {\overset{\sim}{W}\mspace{14mu} U} \rbrack \begin{bmatrix}G_{W} \\G_{U}\end{bmatrix}}} & (20)\end{matrix}$

where {tilde over (W)}ε

₊ ^(F) ^(B) ^(×K′) is the training dictionary of the first source whichwas stored and can now be used, Uε

₊ ^(F) ^(B) ^(×K) ^(u) are the unknown basis functions of the secondsource, G_(W) ε

₊ ^(K′×T) ^(y) are the activation functions of the first source andG_(U)ε

₊ ^(K) ^(u) ^(×T) ^(y) are the activation functions of the secondsource. U, G_(W) and G_(U) are unknown and will be estimated by the NMF.{tilde over (W)} can remain fixed or be used to initialize the NMF andbe further updated by algorithm. After NMF, we obtain an estimate 837,838 of each source Y_(B1)={tilde over (W)}G_(W), and Y_(B2)=UG_(U).These estimates are mapped back 839, 840 to the original time-frequencydomain of the STFT 834 using the transpose of B_(y). Finally, an inverseSTFT 841, 842 is applied to obtain time domain signals y₁(k) and y₂(k)which are estimates of the first and second source respectively.

The steps 833, 834, 835, 836, 837, 838, 839, 840, 841 and 842 in FIG. 8Bare an example of an improved machine learning algorithm that utilizes astored {tilde over (W)} matrix that is determined using any of thetechniques described herein, including those involving multi-sensor ormulti-channel applications as well as those involving multi-resolution,e.g. involving bases matrices that include bases that represent multipletime frequency resolutions.

In another embodiment, the training dictionary {tilde over (W)} can beused in recording studio applications for extracting source signals thathave been captured in signal mixtures. For example, let's assume anexample where a song arrangement contains a guitar (1 sound source),bass (1 sound source), drums (8 sound sources) and piano (1 soundsource) and that four musicians (guitar player, bass player, pianoplayer and drummer) simultaneously perform the song in the studio. Thesound engineer can place N microphones to capture the song andunavoidably each microphone captures the sound of all musicians. In theprevious art there's nothing that the engineer could do in order tocompletely isolate the sound of each musician in each microphone.

However, in this embodiment, special recordings for dictionaryextraction (see block 800 of FIG. 8B) can precede the actual songrecording. During this phase the musicians can provide isolatedrecordings of their instruments. The recordings can be made inside themain Digital Audio Workstation (D.A.W.) or using any other software ofhardware module. Then we can process these recordings in multipletime-frequency resolutions in order to derive and store the dictionary{tilde over (W)} (800), according to one exemplary embodiment.

In additional embodiments, the extraction of the dictionary (800) can beimplemented inside a D.A.W. or in an external hardware unit(s). Thesound engineer can, for example, decide which instruments will berecorded in isolation and therefore which instruments will be taken intoaccount when forming the dictionary (800). The amount of time-frequencyresolutions can be also set by the sound engineer (user) according tothe complexity of the task. After the training phase, the dictionary canbe stored and normal recordings of the song can be made. Then anyreal-time or offline source separation technique (for example NMFtechnique) can be used in order to process the microphone signals andextract isolated sources from the signal. The sound engineer can thenuse the new never-before-available isolated signals in order to createthe desired song mix.

In another embodiment one can create a new matrix by combining thespectrograms V_(p):

{tilde over (V)}=[V ₁ V ₂ . . . V _(p)]  (21)

where {tilde over (V)}ε

₊ ^(F) ^(B) ^(×T′) and T′=Σ₁ ^(p)T_(p). The matrix {tilde over (V)} isthe combination of the training data transformations in the commontime-frequency domain. {tilde over (V)} contains combined informationabout the training data in various time-frequency resolutions. ApplyingNMF to {tilde over (V)} a variant version of {tilde over (W)} describedin (19) is obtained which can be used as training dictionary that takesinto account different time-frequency resolutions. Note that in otherembodiments, any appropriate method can be used to extract {tilde over(W)} from {tilde over (V)} and they are all within the scope of thepresent disclosure.

FIG. 9 presents an alternate embodiment of the method presented herein.The training data are obtained 901. A first STFT transform with windowlength L₁ is applied 902 to provide the magnitude spectrogram X₁. Asecond STFT transform with window length L₂ is applied 903 on the sametraining data to provide the magnitude spectrogram X₂. The process isrepeated until the final STFT transform with window length L_(P) isapplied 904 to provide the magnitude spectrogram X_(P). Then anappropriate frequency mapping is performed with a different mappingmatrix B_(p) for each STFT 905, 906, 907 to get the set of spectrogramsV_(p). The spectrograms V_(p) are combined 908 and an NMF is performedon {tilde over (V)} 909 to obtain the dictionary {tilde over (W)} 910.In other embodiments, the training methods presented in FIGS. 8 and 9can be combined with each other and/or with any other training method,resulting in hybrid training approaches.

FIG. 9A shows an exemplary embodiment for the case of P=2. A trainingsignal 921 is captured in the time-domain. A first STFT transform withwindow length L₁ 922 is applied to the signal to produce the firstspectrogram X₁ with a specific time-frequency resolution. A second STFTtransform with window length L₂ 923 is applied on the same signal toproduce a second spectrogram X₂ with a different time-frequencyresolution. These spectrograms are mapped 924, 925 by frequency mappingmatrices B₁ and B₂ to a common frequency domain with F_(B) bands and thespectrograms V₁, V₂ are produced. V₁, V₂ are combined to produce {tildeover (V)} which is a matrix that describes the training data withdifferent time-frequency resolutions. An NMF 926 is applied on {tildeover (V)} to produce {tilde over (W)}. Matrix {tilde over (W)} describesthe spectral properties of the training data with differenttime-frequency resolutions.

In other embodiments, the multiple time-frequency resolution trainingmethod can be extended for multiple sources and microphones. One assumesa set of training signals x_(m) ^(n)(k) for n=1, 2, . . . , N sourcesand m=1, 2, . . . , M microphones are available. Each training signalx_(m) ^(n) (k) can be a recording of the n-th source in the m-thmicrophone without any other sources being active or present. A set ofmagnitude spectrograms X_(m) ^(n,p) in some domain δ, can be obtainedsimilarly to (2). Each spectrogram X_(m) ^(n,p) is a result of adifferent time-frequency transformation with p=1, 2, . . . , P and P thenumber of total transformations. X_(m) ^(n,p) represents the trainingdata for the n-th source in the m-th microphone as described by the p-thtime-frequency resolution. One can also construct a set of frequencymapping matrices B_(p). One can then have a set of spectrograms V_(m)^(n,p)ε

^(F) ^(B) ^(×T) ^(p) . Similarly to (18) the set of spectrograms V_(m)^(n,p) represent the training data for the n-th source as captured bythe m-th microphone and the p-th time-frequency resolution in the commontime-frequency domain after the mapping provided by the mapping matrixB_(p). One can combine the spectrograms V_(m) ^(n,p) for the n-th sourceas:

V ^(n,p)=[(V ₁ ^(n,p))^(T)(V ₂ ^(n,p))^(T) . . . (V _(M)^(n,p))^(T)]^(T)  (22)

The matrix V ^(n,p)ε

₊ ^(MF) ^(B) ^(×T) ^(p) describes the training data for the n-th sourcewith the p-th time-frequency resolution as captured by all microphones.In one embodiment, one can perform one NMF per matrix V ^(n,p) andobtain a set of matrices W ^(n,p)ε

^(MF) ^(B) ^(×K) ^(n) which can then be combined as:

{tilde over (W)} ^(n) =[W ^(n,1) W ^(n,2) . . . W ^(n,P)]  (23)

where {tilde over (W)}^(n)ε

₊ ^(MF) ^(B) ^(×PK) ^(n) . Note that in other embodiments, anyappropriate method can be used to extract W ^(n,p) from V ^(n,p) and allare within the scope of the present disclosure. The matrix W ^(n,p) is aset of K_(n) basis functions that model the n-th source with the p-thtime-frequency resolution in the common frequency domain provided by themapping matrix B_(p). The matrix {tilde over (W)}^(n) combines all ofthe basis functions for all time-frequency resolutions and microphones.{tilde over (W)}^(n) is a global model of the n-th source.

In another embodiment one can combine the spectrograms V ^(n,p) to oneanother as follows:

{tilde over (V)} ^(n) =[V ^(n,1) V ^(n,2) . . . V ^(n,P)]  (24)

where {tilde over (V)}^(n)ε

₊ ^(MF) ^(B) ^(×T′) and T′=Σ_(p)T_(p). The matrix {tilde over (V)}^(n)describes the training data of the n-th source as captured by allmicrophones for all available time-frequency resolutions in the commontime-frequency domain provided by the mapping matrix Bp. Applying NMF on{tilde over (V)}^(n) can provide a variant version of {tilde over(W)}^(n) in (23).

Whether {tilde over (W)}^(n) is calculated from the NMF of {tilde over(V)}^(n) or the combination of W ^(n,p) from individual NMFs on V ^(n,p)it has the same interpretation. {tilde over (W)}^(n) is an expandedversion of W ^(n) described in (14). While W ^(n) contains basisfunctions that model the n-th source in all microphones, it is limitedto a specific time-frequency resolution and models only a subset of thesource properties. {tilde over (W)}^(n) combines basis functions andmodels the n-th source using different time-frequency resolutions andhence provides a more complete model of the source.

Note that the steps 833, 834, 835, 836, 837, 838, 839, 840, 841 and 842in FIG. 8B can be logically expanded to accommodate the use of {tildeover (W)}^(n) described above. That is, while not shown in FIG. 8B,{tilde over (W)}^(n) can be used to separate sources when bothmulti-sensor and multi-resolution audio data is available.

In another embodiment, the same principle of using multipletime-frequency resolutions simultaneously to extract feature vectorsthat will improve the training and performance of machine learningalgorithms can be used. Consider a time-domain signal x(k) that will beused as an input to a machine learning algorithm. The first step ofusing any machine learning method is to extract a set of features thatdescribe this signal. These features are typically arranged in a vectorform. In the case of audio signals, such features are commonly extractedin the time-frequency domain. Therefore, in another embodiment, multipletime-frequency resolutions are used in order to extract a set offeatures for each time frequency resolution and combined into anextended feature vector.

Aspects of the technology this at least relate to:

A method for improving the separation of audio sources comprising:

obtaining first data from a training source signal in a sensor;

transforming the first data to the time-frequency domain using a firstwindow length and obtaining a first representation;

transforming the first data to the time-frequency domain using a secondwindow length and obtaining a second representation;

determining elements of a training dictionary using one or more signalprocessing algorithms from the first and second representations;

wherein the first and second window lengths are different;

storing the training dictionary elements;

using the training dictionary elements to process second data obtainedby the sensor; and

audibly outputting a signal related to the processed second data.

Any one or more of the above aspects, wherein the first and secondrepresentation are mapped to a time-frequency domain with commonfrequency resolution before determining the training dictionaryelements.

Any one or more of the above aspects, where the source signal is singlechannel or binaural or multichannel audio signal.

Any one or more of the above aspects, where the signal processingalgorithms are one or more of non-negative matrix factorization,non-negative tensor factorization, independent component analysis,principal component analysis, singular value decomposition, dependentcomponent analysis, low-complexity coding and decoding, stationarysubspace analysis, common spatial pattern, empirical mode decomposition,tensor decomposition, canonical polyadic decomposition, higher-ordersingular value decomposition, and tucker decomposition.

Any one or more of the above aspects, where the training dictionary isused for source separation.

Any one or more of the above aspects, where the representations can beobtained with any one or more of a short-time Fourier transform (STFT),a wavelet transform, a polyphase filterbank, a multi rate filterbank, aquadrature mirror filterbank, a warped filterbank, an auditory-inspiredfilterbank, a tree-structured array of filterbanks, etc.

Any one or more of the above aspects, where the data are captured inlive or studio music events from one or more microphones.

A method for improving the separation of audio sources comprising:

capturing one or more sound sources from two or more microphones andcreating a first set of two or more time-domain signals;

storing the first set of time-domain signals;

removing silence from the first set of time domain signals;

transforming the first set of time domain signals via a time-frequencytransform and creating two or more representations;

stacking the representations and creating a new representation;

extracting training dictionary elements using one or more signalprocessing algorithms from the new representation;

storing the training dictionary elements;

using the training dictionary elements to process a second set of two ormore time-domain signals obtained by the two or more microphones; and

audibly outputting the processed second set of time domain signals.

Any one or more of the above aspects, where the time-domain signals aresingle channel or binaural or multichannel audio signals.

Any one or more of the above aspects, where the signal processingalgorithms are one or more of non-negative matrix factorization,non-negative tensor factorization, independent component analysis,principal component analysis, singular value decomposition, dependentcomponent analysis, low-complexity coding and decoding, stationarysubspace analysis, common spatial pattern, empirical mode decomposition,tensor decomposition, canonical polyadic decomposition, higher-ordersingular value decomposition, and tucker decomposition.

Any one or more of the above aspects, where the training dictionary isused for source separation

Any one or more of the above aspects, where sound sources are capturedin live or studio music events.

A system that improves the separation of audio sources comprising:

two or more microphones that capture one or more sound sources

a transform that creates a first set of two or more time-domain signals;

memory adapted to store the first set of time-domain signals;

a processor adapted to remove silence from the first set of time domainsignals;

a transformer that transforms the first set of time domain signals via atime-frequency transform and creates two or more representations;

one or more signal processing algorithms that stack the representations,create a new representation and extract training dictionary elementsfrom the new representation;

storage that stores the training dictionary elements;

the training dictionary elements used to process a second set of two ormore time-domain signals obtained by the two or more microphones; and

at least one speaker that audibly outputs the processed second set oftime domain signals.

Any one or more of the above aspects, where the time-domain signals aresingle channel or binaural or multichannel audio signals.

Any one or more of the above aspects, where the signal processingalgorithms are one or more of non-negative matrix factorization,non-negative tensor factorization, independent component analysis,principal component analysis, singular value decomposition, dependentcomponent analysis, low-complexity coding and decoding, stationarysubspace analysis, common spatial pattern, empirical mode decomposition,tensor decomposition, canonical polyadic decomposition, higher-ordersingular value decomposition, and tucker decomposition.

Any one or more of the above aspects, where the training dictionary isused for source separation

Any one or more of the above aspects, where sound sources are capturedin live or studio music events.

A system for improving the separation of audio sources comprising:

means for obtaining first data from a training source signal in asensor;

means for transforming the first data to the time-frequency domain usinga first window length and obtaining a first representation;

means for transforming the first data to the time-frequency domain usinga second window length and obtaining a second representation;

means for determining elements of a training dictionary using one ormore signal processing algorithms from the first and secondrepresentations, wherein the first and second window lengths aredifferent;

means for storing the training dictionary elements;

means for using the training dictionary elements to process second dataobtained by the sensor; and

means for audibly outputting a signal related to the processed seconddata.

One or more means to implement any one or more of the above aspects.

A non-transitory computer-readable information storage media havingstored thereon instructions, that when executed by one or morecontrollers/processors, cause to be performed the method in any one ormore of the above aspects.

While the above-described flowcharts have been discussed in relation toa particular sequence of events, it should be appreciated that changesto this sequence can occur without materially effecting the operation ofthe invention. Additionally, the exemplary techniques illustrated hereinare not limited to the specifically illustrated embodiments but can alsobe utilized and combined with the other exemplary embodiments and eachdescribed feature is individually and separately claimable.

While the above described embodiments and flowcharts have focused on anexemplary application involving audio signals, and hence often use termssuch as sound source and microphone, it is to be understood that themethods are applicable to processing data originating from anycommunications source as well, including any wired or wireless signal.It is also meant to be understood that the sensor can be any device thatcan receive or perceive the source signal, such as a communicationsreceiver, a modem or the like. Thus the methods described above alsoapply in multi-user or multi-transceiver communications systems wheremultiple data signals (which may include reference or training datawhich is typically known, as well as user data which is typically meantto be communicated as information) are exchanged between transmittersand receivers and where (i) the communications paths between eachtransmitter-receiver pairs are taken into consideration and consideredin a multi-user or multi-transceiver environment and (ii) different timefrequency resolutions can be utilized on data signals communicatedbetween each transmitter-receiver pair to capture different spectralcharacteristics of the data signal.

In such communications systems, the training signals could be referencesignals or signals transmitted and/or received during an initializationphase, and the non-training signals can be steady state or other signalstransmitted/received during information exchange between transceiverdevices. The multiple transmitters are the sources, the multiplereceivers are the sensors. Machine learning algorithms would takeadvantage or the multi-sensor, multi-channel nature of such a multi-usercommunications system to improve multi-user performance (also known asmultiple input multiple output MIMO systems) in the presence of noiseand crosstalk (i.e., the disruption caused between users) usingtechniques similar to the ones described above (for multi-sensors) andbelow (for multi-resolution). In this case, the W matrices (storeddictionary matrix 20 in FIG. 10) represent the spectral properties ofeach transmitter (as received in each receiver and at one or moretime-frequency resolutions) and the output signals y_(n) (60, 61 and 63in FIG. 11) are communications signals each representing aspects ofestimates of individually transmitted (i.e. separated) signal x_(m).

Additionally, the systems, methods and protocols of this invention canbe implemented on a special purpose computer, a programmedmicro-processor or microcontroller and peripheral integrated circuitelement(s), an ASIC or other integrated circuit, a digital signalprocessor, a hard-wired electronic or logic circuit such as discreteelement circuit, a programmable logic device such as PLD, PLA, FPGA,PAL, a modem, a transmitter/receiver, any comparable means, or the like.In general, any device capable of implementing a state machine that isin turn capable of implementing the methodology illustrated herein canbe used to implement the various communication methods, protocols andtechniques according to this invention.

Furthermore, the disclosed methods may be readily implemented insoftware using object or object-oriented software developmentenvironments that provide portable source code that can be used on avariety of computer or workstation platforms. Alternatively, thedisclosed methods may be readily implemented in software on an embeddedprocessor, a micro-processor or a digital signal processor. Theimplementation may utilize either fixed-point or floating pointoperations or both. In the case of fixed point operations,approximations may be used for certain mathematical operations such aslogarithms, exponentials, etc. Alternatively, the disclosed system maybe implemented partially or fully in hardware using standard logiccircuits or VLSI design. Whether software or hardware is used toimplement the systems in accordance with this invention is dependent onthe speed and/or efficiency requirements of the system, the particularfunction, and the particular software or hardware systems ormicroprocessor or microcomputer systems being utilized. The systems andmethods illustrated herein can be readily implemented in hardware and/orsoftware using any known or later developed systems or structures,devices and/or software by those of ordinary skill in the applicable artfrom the functional description provided herein and with a general basicknowledge of the audio processing arts.

Moreover, the disclosed methods may be readily implemented in softwarethat can be stored on a storage medium, executed on programmedgeneral-purpose computer with the cooperation of a controller andmemory, a special purpose computer, a microprocessor, or the like. Inthese instances, the systems and methods of this invention can beimplemented as program embedded on personal computer such as an applet,JAVA® or CGI script, as a resource residing on a server or computerworkstation, as a routine embedded in a dedicated system or systemcomponent, or the like. The system can also be implemented by physicallyincorporating the system and/or method into a software and/or hardwaresystem, such as the hardware and software systems of an electronicdevice.

It is therefore apparent that there has been provided, in accordancewith the present invention, systems and methods for data training inmulti-sensor setups. While this invention has been described inconjunction with a number of embodiments, it is evident that manyalternatives, modifications and variations would be or are apparent tothose of ordinary skill in the applicable arts. Accordingly, it isintended to embrace all such alternatives, modifications, equivalentsand variations that are within the spirit and scope of this invention.

What is claimed is:
 1. A method for improving the separation of audiosources comprising: obtaining first data from a training source signalin a sensor; transforming the first data to the time-frequency domainusing a first window length and obtaining a first representation;transforming the first data to the time-frequency domain using a secondwindow length and obtaining a second representation; determiningelements of a training dictionary using one or more signal processingalgorithms from the first and second representations; wherein the firstand second window lengths are different; storing the training dictionaryelements; using the training dictionary elements to process second dataobtained by the sensor; and audibly outputting a signal related to theprocessed second data.
 2. The method of claim 1, wherein the first andsecond representations are mapped to a time-frequency domain with commonfrequency resolution before determining the training dictionaryelements.
 3. The method of claim 1, where the source signal is singlechannel or binaural or multichannel audio signal.
 4. The method of claim1, where the signal processing algorithms are one or more ofnon-negative matrix factorization, non-negative tensor factorization,independent component analysis, principal component analysis, singularvalue decomposition, dependent component analysis, low-complexity codingand decoding, stationary subspace analysis, common spatial pattern,empirical mode decomposition, tensor decomposition, canonical polyadicdecomposition, higher-order singular value decomposition, and tuckerdecomposition.
 5. The method of claim 1, where the training dictionaryis used for source separation.
 6. The method of claim 1, where therepresentations can be obtained with any one or more of a short-timeFourier transform (STFT), a wavelet transform, a polyphase filterbank, amulti rate filterbank, a quadrature mirror filterbank, a warpedfilterbank, an auditory-inspired filterbank, and a tree-structured arrayof filterbanks.
 7. The method of claim 1, where the data are captured inlive or studio music events from one or more microphones.
 8. A methodfor improving the separation of audio sources comprising: capturing oneor more sound sources from two or more microphones and creating a firstset of two or more time-domain signals; storing the first set oftime-domain signals; removing silence from the first set of time domainsignals; transforming the first set of time domain signals via atime-frequency transform and creating two or more representations;stacking the representations and creating a new representation;extracting training dictionary elements using one or more signalprocessing algorithms from the new representation; storing the trainingdictionary elements; using the training dictionary elements to process asecond set of two or more time-domain signals obtained by the two ormore microphones; and audibly outputting the processed second set oftime domain signals.
 9. The method of claim 8, where the time-domainsignals are single channel or binaural or multichannel audio signals.10. The method of claim 8, where the signal processing algorithms areone or more of non-negative matrix factorization, non-negative tensorfactorization, independent component analysis, principal componentanalysis, singular value decomposition, dependent component analysis,low-complexity coding and decoding, stationary subspace analysis, commonspatial pattern, empirical mode decomposition, tensor decomposition,canonical polyadic decomposition, higher-order singular valuedecomposition, and tucker decomposition.
 11. The method of claim 8,where the training dictionary is used for source separation
 12. Themethod of claim 8, where sound sources are captured in live or studiomusic events.
 13. A system that improves the separation of audio sourcescomprising: two or more microphones that capture one or more soundsources; a transform that creates a first set of two or more time-domainsignals; memory adapted to store the first set of time-domain signals; aprocessor adapted to remove silence from the first set of time domainsignals; a transformer that transforms the first set of time domainsignals via a time-frequency transform and creates two or morerepresentations; one or more signal processing algorithms that stack therepresentations to create a new representation and extract trainingdictionary elements from the new representation; storage that stores thetraining dictionary elements, the training dictionary elements used toprocess a second set of two or more time-domain signals obtained by thetwo or more microphones; and at least one speaker that audibly outputsthe processed second set of time domain signals.
 14. The system of claim13, where the time-domain signals are single channel or binaural ormultichannel audio signals.
 15. The system of claim 13, where the signalprocessing algorithms are one or more of non-negative matrixfactorization, non-negative tensor factorization, independent componentanalysis, principal component analysis, singular value decomposition,dependent component analysis, low-complexity coding and decoding,stationary subspace analysis, common spatial pattern, empirical modedecomposition, tensor decomposition, canonical polyadic decomposition,higher-order singular value decomposition, and tucker decomposition. 16.The system of claim 13, where the training dictionary is used for sourceseparation
 17. The system of claim 13, where sound sources are capturedin live or studio music events.
 18. The system of claim 13, wherein theoutput processed second set of time domain signals have improvedseparation.
 19. The system of claim 13, wherein the representations canbe obtained with any one or more of a short-time Fourier transform(STFT), a wavelet transform, a polyphase filterbank, a multi ratefilterbank, a quadrature mirror filterbank, a warped filterbank, anauditory-inspired filterbank, and a tree-structured array offilterbanks.
 20. The system of claim 13, wherein an effect of anacoustic path is accounted for by the signal processing algorithms.